-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16174][SQL] Improve OptimizeIn
optimizer to remove literal repetitions
#13876
Conversation
Test build #61126 has finished for PR 13876 at commit
|
@@ -793,6 +794,20 @@ object ConstantFolding extends Rule[LogicalPlan] { | |||
} | |||
|
|||
/** | |||
* Removes literal repetitions from IN predicate | |||
*/ | |||
object RemoveLiteralRepetitionFromIn extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this just go into OptimzieIn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also does this need to be literal specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review, @rxin .
- Sure, I can merge this into OptimizeIn.
- Also, it can be used for deterministic expressions.
I'm just here focus on literals. May I handle both cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea why don't we handle both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! No problem.
Test build #61136 has finished for PR 13876 at commit
|
Test build #61135 has finished for PR 13876 at commit
|
Test build #61137 has finished for PR 13876 at commit
|
Hi, @rxin . |
Hi, @rxin . |
Test build #61232 has finished for PR 13876 at commit
|
Hi, @rxin . |
Test build #61289 has finished for PR 13876 at commit
|
Hi, @rxin . |
Hi, @rxin . |
OptimizeIn
optimizer to remove deterministic repetitions
Test build #61563 has finished for PR 13876 at commit
|
Hi, @rxin . |
Rebased to the master. |
Test build #61662 has finished for PR 13876 at commit
|
Test build #61760 has finished for PR 13876 at commit
|
Hi, @rxin . |
val hSet = list.map(e => e.eval(EmptyRow)) | ||
InSet(v, HashSet() ++ hSet) | ||
case i @ In(v, list) => | ||
val (deterministics, others) = list.partition(_.deterministic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one question i have is how often do we see an in expression with some expressions being deterministic and some nondeterministic? if not, i'd just simplify this so we only do it if everything is deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. In real situation, case i @ In(v, list) if list.forall(_.deterministic)
will cover the most cases.
I'll update like that. Thank you for review again!
Test build #61860 has finished for PR 13876 at commit
|
Hi, @rxin . |
} else if (newList.length < list.length) { | ||
i.copy(v, newList) | ||
} else { // newList.length == list.length | ||
i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can bring some performance concerns because we are doing a lot of work in order to return the original query, and given the optimizer is iterative, it would spend a lot of cycles just doing this.
Can we introduce a flag (lazy val) to the In expression to check whether it is optimizable? If it is not, then we shouldn't even go into the case. Something like
case class In(...) {
lazy val inSetConvertable: Boolean = list.forall(_.deterministic)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. That sounds great. I'll fix soon.
@@ -132,6 +132,7 @@ case class In(value: Expression, list: Seq[Expression]) extends Predicate | |||
} | |||
|
|||
override def children: Seq[Expression] = value +: list | |||
lazy val inSetConvertible = children.forall(_.deterministic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad - we should put newList.forall(_.isInstanceOf[Literal])
here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mark the type explicitly since this is a public funciton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, then, the semantic is different. What you mean is just improving InSet
.
My original PR was about for deletion about all deterministic duplications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, if that is your intention, Okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah.. We need to update all PR/JIRA description, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup.
Test build #61875 has finished for PR 13876 at commit
|
OptimizeIn
optimizer to remove deterministic repetitionsOptimizeIn
optimizer to remove literal repetitions
Now, the scope of PR is reduced a lot. But, I hope this PR still covers majority of real queries. |
*/ | ||
case class OptimizeIn(conf: CatalystConf) extends Rule[LogicalPlan] { | ||
def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
case q: LogicalPlan => q transformExpressionsDown { | ||
case In(v, list) if !list.exists(!_.isInstanceOf[Literal]) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, here is regression. Originally, v
could be non-deterministic.
Test build #61878 has finished for PR 13876 at commit
|
Test build #61887 has finished for PR 13876 at commit
|
Test build #61888 has finished for PR 13876 at commit
|
val hSet = newList.map(e => e.eval(EmptyRow)) | ||
InSet(v, HashSet() ++ hSet) | ||
} else if (newList.size < list.size) { | ||
expr.copy(value = v, list = newList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to copy value here, do you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's a whole value. We had better create the expression In
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry what i meant was ... we only need to do
expr.copy(list = newList)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. My bad!
Looks pretty good. cc @cloud-fan for another look. |
LGTM |
Thank you for review, @cloud-fan . |
Test build #61898 has finished for PR 13876 at commit
|
There is one failure in
|
Retest this please. |
Test build #61897 has finished for PR 13876 at commit
|
Test build #61901 has finished for PR 13876 at commit
|
At this time, it passed as expected. |
thanks, merging to master! |
Thank you for review and merging, @cloud-fan and @rxin . |
What changes were proposed in this pull request?
This PR improves
OptimizeIn
optimizer to remove the literal repetitions from SQLIN
predicates. This optimizer prevents user mistakes and also can optimize some queries like TPCDS-36.Before
After
How was this patch tested?
Pass the Jenkins tests (including a new testcase).